INTRODUCTION
Airbnb is a community platform service that connects travelers with hotel companies, rental property investors, and individuals who rent out all or part of their own home as a spare home. The site offers a search and booking platform between the person offering their accommodation and a renter. It covers more than 1.5 million rental ads in over 34,000 cities and 191 countries. In our study, we will restrict ourselves to the city of Paris.
We will use Machine Learning algorithms to predict the price of an Airbnb rental in Paris.
The use case of our work would be :
Data source : http://insideairbnb.com/get-the-data.html
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
%load_ext autoreload
%autoreload 2
%matplotlib inline
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
!pip3 install --quiet -r requirements.txt
!jupyter nbextension enable --py widgetsnbextension
!jupyter trust Airbnb_ML.ipynb
Enabling notebook extension jupyter-js-widgets/extension...
- Validating: ok
Notebook already signed: Airbnb_ML.ipynb
import os
import sys
print("Python version: {}". format(sys.version))
import IPython
from IPython import display
from IPython.display import Markdown, display, HTML
print("IPython version: {}". format(IPython.__version__))
import numpy as np
print("NumPy version: {}". format(np.__version__))
import scipy as sp
from scipy import stats
print("SciPy version: {}". format(sp.__version__))
import pandas as pd
print("pandas version: {}". format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}". format(matplotlib.__version__))
import seaborn as sns
print("seaborn version : {}". format(sns.__version__))
sns.set()
import missingno as msno
print("missingno version : {}". format(msno.__version__))
import pandas_profiling
print("pandas_profiling version : {}". format(pandas_profiling.__version__))
import folium
print("folium version : {}". format(folium.__version__))
import wordcloud
print("wordcloud version : {}". format(wordcloud.__version__))
import ipywidgets
print("ipywidgets version : {}". format(ipywidgets.__version__))
import sklearn
print("scikit-learn version : {}". format(sklearn.__version__))
print("====================================")
Python version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] IPython version: 7.27.0 NumPy version: 1.20.3 SciPy version: 1.7.1 pandas version: 1.3.2 matplotlib version: 3.4.2 seaborn version : 0.11.2 missingno version : 0.5.0 pandas_profiling version : 3.1.0 folium version : 0.12.1.post1 wordcloud version : 1.8.1 ipywidgets version : 7.6.5 scikit-learn version : 1.0.1 ====================================
def printmd(texte, couleur=None):
"""
Printing of the string texte by choosing the style (for example bold) and the color (couleur)
"""
colorstr = "<span style='color:{}'>{}</span>".format(couleur, texte)
display(Markdown(colorstr))
import zipfile
zip_file = "./dataset/airbnb_compressed.zip"
try:
with zipfile.ZipFile(zip_file) as z:
z.extractall("./dataset")
print("Extracted all")
except:
print("Invalid file")
Extracted all
# df_main => cleaned summary of the dataset, good for visualisations
df_main = pd.read_csv('./dataset/airbnb_paris.csv', index_col='id')
# df_listings => detailed dataset with all collected columns
df_listings = pd.read_csv('./dataset/listings.csv', index_col='id')
The goal of this part is to compute exploratory data analysis on the dataset to have a good overview of it and gain information.
As some part require accurate information, we also do some little preprocessing to have the best exploratory analysis possible.
data_folder = "dataset"
file_name = "Inside_Airbnb_Data_Dictionary.csv"
dictionary_file_path = os.path.join(".", data_folder, file_name)
separateur = ";"
pd.options.display.max_rows = 80
df_dictionary = pd.read_csv(dictionary_file_path, sep=separateur, index_col="Field")
df_dictionary[["Type", "Description"]]
| Type | Description | |
|---|---|---|
| Field | ||
| id | integer | Airbnb's unique identifier for the listing |
| listing_url | text | NaN |
| scrape_id | bigint | Inside Airbnb "Scrape" this was part of |
| last_scraped | datetime | UTC. The date and time this listing was "scrap... |
| name | text | Name of the listing |
| description | text | Detailed description of the listing |
| neighborhood_overview | text | Host's description of the neighbourhood |
| picture_url | text | URL to the Airbnb hosted regular sized image f... |
| host_id | integer | Airbnb's unique identifier for the host/user |
| host_url | text | The Airbnb page for the host |
| host_name | text | Name of the host. Usually just the first name(s). |
| host_since | date | The date the host/user was created. For hosts ... |
| host_location | text | The host's self reported location |
| host_about | text | Description about the host |
| host_response_time | NaN | NaN |
| host_response_rate | NaN | NaN |
| host_acceptance_rate | NaN | That rate at which a host accepts booking requ... |
| host_is_superhost | boolean [t=true; f=false] | NaN |
| host_thumbnail_url | text | NaN |
| host_picture_url | text | NaN |
| host_neighbourhood | text | NaN |
| host_listings_count | text | The number of listings the host has (per Airbn... |
| host_total_listings_count | text | The number of listings the host has (per Airbn... |
| host_verifications | NaN | NaN |
| host_has_profile_pic | boolean [t=true; f=false] | NaN |
| host_identity_verified | boolean [t=true; f=false] | NaN |
| neighbourhood | text | NaN |
| neighbourhood_cleansed | text | The neighbourhood as geocoded using the latitu... |
| neighbourhood_group_cleansed | text | The neighbourhood group as geocoded using the ... |
| latitude | numeric | Uses the World Geodetic System (WGS84) project... |
| longitude | numeric | Uses the World Geodetic System (WGS84) project... |
| property_type | text | Self selected property type. Hotels and Bed an... |
| room_type | text | [Entire home/apt|Private room|Shared room|Hote... |
| accommodates | integer | The maximum capacity of the listing |
| bathrooms | numeric | The number of bathrooms in the listing |
| bathrooms_text | string | The number of bathrooms in the listing. \nOn t... |
| bedrooms | integer | The number of bedrooms |
| beds | integer | The number of bed(s) |
| amenities | json | NaN |
| price | currency | daily price in local currency |
| minimum_nights | integer | minimum number of night stay for the listing (... |
| maximum_nights | integer | maximum number of night stay for the listing (... |
| minimum_minimum_nights | integer | the smallest minimum_night value from the cale... |
| maximum_minimum_nights | integer | the largest minimum_night value from the calen... |
| minimum_maximum_nights | integer | the smallest maximum_night value from the cale... |
| maximum_maximum_nights | integer | the largest maximum_night value from the calen... |
| minimum_nights_avg_ntm | numeric | the average minimum_night value from the calen... |
| maximum_nights_avg_ntm | numeric | the average maximum_night value from the calen... |
| calendar_updated | date | NaN |
| has_availability | boolean | [t=true; f=false] |
| availability_30 | integer | avaliability_x. The availability of the listin... |
| availability_60 | integer | avaliability_x. The availability of the listin... |
| availability_90 | integer | avaliability_x. The availability of the listin... |
| availability_365 | integer | avaliability_x. The availability of the listin... |
| calendar_last_scraped | date | NaN |
| number_of_reviews | integer | The number of reviews the listing has |
| number_of_reviews_ltm | integer | The number of reviews the listing has (in the ... |
| number_of_reviews_l30d | integer | The number of reviews the listing has (in the ... |
| first_review | date | The date of the first/oldest review |
| last_review | date | The date of the last/newest review |
| review_scores_rating | NaN | NaN |
| review_scores_accuracy | NaN | NaN |
| review_scores_cleanliness | NaN | NaN |
| review_scores_checkin | NaN | NaN |
| review_scores_communication | NaN | NaN |
| review_scores_location | NaN | NaN |
| review_scores_value | NaN | NaN |
| license | text | The licence/permit/registration number |
| instant_bookable | boolean | [t=true; f=false]. Whether the guest can autom... |
| calculated_host_listings_count | integer | The number of listings the host has in the cur... |
| calculated_host_listings_count_entire_homes | integer | The number of Entire home/apt listings the hos... |
| calculated_host_listings_count_private_rooms | integer | The number of Private room listings the host h... |
| calculated_host_listings_count_shared_rooms | integer | The number of Shared room listings the host ha... |
| reviews_per_month | numeric | The number of reviews the listing has over the... |
df_main.head()
| name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||
| 5396 | Explore the heart of old Paris | 7903 | Borzou | NaN | Hôtel-de-Ville | 48.85247 | 2.35835 | Entire home/apt | 102 | 2 | 273 | 2021-12-08 | 1.80 | 1 | 58 | 42 | 7510402838018 |
| 7397 | MARAIS - 2ROOMS APT - 2/4 PEOPLE | 2626 | Franck | NaN | Hôtel-de-Ville | 48.85909 | 2.35315 | Entire home/apt | 112 | 10 | 288 | 2021-12-05 | 2.22 | 2 | 209 | 19 | 7510400829623 |
| 7964 | Large & sunny flat with balcony ! | 22155 | Anaïs | NaN | Opéra | 48.87417 | 2.34245 | Entire home/apt | 130 | 6 | 6 | 2015-09-14 | 0.04 | 1 | 344 | 0 | 7510903576564 |
| 9359 | Cozy, Central Paris: WALK or VELIB EVERYWHERE ! | 28422 | Bernadette | NaN | Louvre | 48.86006 | 2.34863 | Entire home/apt | 75 | 180 | 0 | NaN | NaN | 1 | 117 | 0 | Available with a mobility lease only ("bail mo... |
| 9952 | Paris petit coin douillet | 33534 | Elisabeth | NaN | Popincourt | 48.86373 | 2.37093 | Entire home/apt | 81 | 4 | 33 | 2021-10-22 | 0.31 | 1 | 260 | 7 | 7511101582862 |
df_main.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49429 entries, 5396 to 53712667 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 49384 non-null object 1 host_id 49429 non-null int64 2 host_name 49397 non-null object 3 neighbourhood_group 0 non-null float64 4 neighbourhood 49429 non-null object 5 latitude 49429 non-null float64 6 longitude 49429 non-null float64 7 room_type 49429 non-null object 8 price 49429 non-null int64 9 minimum_nights 49429 non-null int64 10 number_of_reviews 49429 non-null int64 11 last_review 38767 non-null object 12 reviews_per_month 38767 non-null float64 13 calculated_host_listings_count 49429 non-null int64 14 availability_365 49429 non-null int64 15 number_of_reviews_ltm 49429 non-null int64 16 license 28122 non-null object dtypes: float64(4), int64(7), object(6) memory usage: 6.8+ MB
We notice that the attribute "neighbourhood_group" is not filled in
df_main.describe()
| host_id | neighbourhood_group | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.942900e+04 | 0.0 | 49429.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 38767.000000 | 49429.000000 | 49429.000000 | 49429.000000 |
| mean | 9.896725e+07 | NaN | 48.864008 | 2.344914 | 129.891986 | 109.405086 | 21.636590 | 0.796471 | 10.652188 | 99.513059 | 3.635801 |
| std | 1.208700e+08 | NaN | 0.018181 | 0.033072 | 217.186069 | 168.774442 | 46.339395 | 1.191827 | 35.528471 | 134.092923 | 13.033774 |
| min | 2.626000e+03 | NaN | 48.812580 | 2.225720 | 0.000000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 1.340742e+07 | NaN | 48.850820 | 2.324110 | 61.000000 | 2.000000 | 1.000000 | 0.120000 | 1.000000 | 0.000000 | 0.000000 |
| 50% | 3.883365e+07 | NaN | 48.865320 | 2.347900 | 90.000000 | 4.000000 | 6.000000 | 0.370000 | 1.000000 | 3.000000 | 0.000000 |
| 75% | 1.470187e+08 | NaN | 48.878530 | 2.369270 | 140.000000 | 365.000000 | 21.000000 | 1.000000 | 2.000000 | 198.000000 | 3.000000 |
| max | 4.350258e+08 | NaN | 48.904860 | 2.472030 | 10250.000000 | 9999.000000 | 1809.000000 | 50.860000 | 252.000000 | 365.000000 | 1705.000000 |
# https://python-visualization.github.io/folium/
import folium, folium.plugins
paris_location = [48.86, 2.34]
map = folium.Map(location=paris_location, zoom_start=12)
folium.plugins.FastMarkerCluster(data=df_main[['latitude', 'longitude']]).add_to(map)
map.save( 'map.html')
map
def generate_wordcloud(df,col):
content = ''
stopwords = set(STOPWORDS)
for val in df[col].values:
val = str(val)
tokens = val.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
content += " ".join(tokens)+" "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(content)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
generate_wordcloud(df_main,'name')
# the bigger the word, the more frequent it is in this field
We can see that the description of the apartments listed in our dataset are skewed towards upper class paris areas. We can also see that there is especially succesful appartment quality adjectives like : cosy, bright, charming etc
generate_wordcloud(df_main,'neighbourhood')
df_listings.head()
| listing_url | scrape_id | last_scraped | name | description | neighborhood_overview | picture_url | host_id | host_url | host_name | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 5396 | https://www.airbnb.com/rooms/5396 | 20211207182322 | 2021-12-09 | Explore the heart of old Paris | Cozy, well-appointed and graciously designed s... | You are within walking distance to the Louvre,... | https://a0.muscache.com/pictures/52413/f9bf76f... | 7903 | https://www.airbnb.com/users/show/7903 | Borzou | ... | 4.81 | 4.96 | 4.53 | 7510402838018 | f | 1 | 1 | 0 | 0 | 1.80 |
| 7397 | https://www.airbnb.com/rooms/7397 | 20211207182322 | 2021-12-09 | MARAIS - 2ROOMS APT - 2/4 PEOPLE | VERY CONVENIENT, WITH THE BEST LOCATION !<br /... | NaN | https://a0.muscache.com/pictures/67928287/330b... | 2626 | https://www.airbnb.com/users/show/2626 | Franck | ... | 4.88 | 4.92 | 4.71 | 7510400829623 | f | 2 | 2 | 0 | 0 | 2.22 |
| 7964 | https://www.airbnb.com/rooms/7964 | 20211207182322 | 2021-12-08 | Large & sunny flat with balcony ! | Very large & nice apartment all for you! <br /... | NaN | https://a0.muscache.com/pictures/4471349/6fb3d... | 22155 | https://www.airbnb.com/users/show/22155 | Anaïs | ... | 5.00 | 5.00 | 5.00 | 7510903576564 | f | 1 | 1 | 0 | 0 | 0.04 |
| 9359 | https://www.airbnb.com/rooms/9359 | 20211207182322 | 2021-12-09 | Cozy, Central Paris: WALK or VELIB EVERYWHERE ! | Location! Location! Location! Just bring your ... | NaN | https://a0.muscache.com/pictures/c2965945-061f... | 28422 | https://www.airbnb.com/users/show/28422 | Bernadette | ... | NaN | NaN | NaN | Available with a mobility lease only ("bail mo... | f | 1 | 1 | 0 | 0 | NaN |
| 9952 | https://www.airbnb.com/rooms/9952 | 20211207182322 | 2021-12-09 | Paris petit coin douillet | Je suis une dame retraitée, qui propose un agr... | Vibrant neighborhood, full of bars, cafés, fre... | https://a0.muscache.com/pictures/ae822d16-74d2... | 33534 | https://www.airbnb.com/users/show/33534 | Elisabeth | ... | 4.91 | 4.91 | 4.94 | 7511101582862 | f | 1 | 1 | 0 | 0 | 0.31 |
5 rows × 73 columns
df_listings.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49429 entries, 5396 to 53712667 Data columns (total 73 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 listing_url 49429 non-null object 1 scrape_id 49429 non-null int64 2 last_scraped 49429 non-null object 3 name 49384 non-null object 4 description 48706 non-null object 5 neighborhood_overview 29455 non-null object 6 picture_url 49428 non-null object 7 host_id 49429 non-null int64 8 host_url 49429 non-null object 9 host_name 49397 non-null object 10 host_since 49397 non-null object 11 host_location 49303 non-null object 12 host_about 25075 non-null object 13 host_response_time 21820 non-null object 14 host_response_rate 21820 non-null object 15 host_acceptance_rate 24605 non-null object 16 host_is_superhost 49397 non-null object 17 host_thumbnail_url 49397 non-null object 18 host_picture_url 49397 non-null object 19 host_neighbourhood 35037 non-null object 20 host_listings_count 49397 non-null float64 21 host_total_listings_count 49397 non-null float64 22 host_verifications 49429 non-null object 23 host_has_profile_pic 49397 non-null object 24 host_identity_verified 49397 non-null object 25 neighbourhood 29455 non-null object 26 neighbourhood_cleansed 49429 non-null object 27 neighbourhood_group_cleansed 0 non-null float64 28 latitude 49429 non-null float64 29 longitude 49429 non-null float64 30 property_type 49429 non-null object 31 room_type 49429 non-null object 32 accommodates 49429 non-null int64 33 bathrooms 0 non-null float64 34 bathrooms_text 49274 non-null object 35 bedrooms 40293 non-null float64 36 beds 47485 non-null float64 37 amenities 49429 non-null object 38 price 49429 non-null object 39 minimum_nights 49429 non-null int64 40 maximum_nights 49429 non-null int64 41 minimum_minimum_nights 49424 non-null float64 42 maximum_minimum_nights 49424 non-null float64 43 minimum_maximum_nights 49424 non-null float64 44 maximum_maximum_nights 49424 non-null float64 45 minimum_nights_avg_ntm 49424 non-null float64 46 maximum_nights_avg_ntm 49424 non-null float64 47 calendar_updated 0 non-null float64 48 has_availability 49429 non-null object 49 availability_30 49429 non-null int64 50 availability_60 49429 non-null int64 51 availability_90 49429 non-null int64 52 availability_365 49429 non-null int64 53 calendar_last_scraped 49429 non-null object 54 number_of_reviews 49429 non-null int64 55 number_of_reviews_ltm 49429 non-null int64 56 number_of_reviews_l30d 49429 non-null int64 57 first_review 38767 non-null object 58 last_review 38767 non-null object 59 review_scores_rating 38767 non-null float64 60 review_scores_accuracy 38119 non-null float64 61 review_scores_cleanliness 38124 non-null float64 62 review_scores_checkin 38105 non-null float64 63 review_scores_communication 38117 non-null float64 64 review_scores_location 38103 non-null float64 65 review_scores_value 38100 non-null float64 66 license 28122 non-null object 67 instant_bookable 49429 non-null object 68 calculated_host_listings_count 49429 non-null int64 69 calculated_host_listings_count_entire_homes 49429 non-null int64 70 calculated_host_listings_count_private_rooms 49429 non-null int64 71 calculated_host_listings_count_shared_rooms 49429 non-null int64 72 reviews_per_month 38767 non-null float64 dtypes: float64(23), int64(16), object(34) memory usage: 27.9+ MB
df_listings.describe()
| scrape_id | host_id | host_listings_count | host_total_listings_count | neighbourhood_group_cleansed | latitude | longitude | accommodates | bathrooms | bedrooms | ... | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.942900e+04 | 4.942900e+04 | 49397.000000 | 49397.000000 | 0.0 | 49429.000000 | 49429.000000 | 49429.000000 | 0.0 | 40293.000000 | ... | 38124.00000 | 38105.000000 | 38117.000000 | 38103.000000 | 38100.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 38767.000000 |
| mean | 2.021121e+13 | 9.896725e+07 | 13.513149 | 13.513149 | NaN | 48.864008 | 2.344914 | 3.044346 | NaN | 1.369990 | ... | 4.59638 | 4.802476 | 4.813442 | 4.805913 | 4.618081 | 10.652188 | 9.945579 | 0.460904 | 0.019604 | 0.796471 |
| std | 1.172668e+01 | 1.208700e+08 | 83.562486 | 83.562486 | NaN | 0.018181 | 0.033072 | 1.622670 | NaN | 1.015249 | ... | 0.53207 | 0.399383 | 0.401767 | 0.348673 | 0.464296 | 35.528471 | 35.404664 | 2.101099 | 0.306851 | 1.191827 |
| min | 2.021121e+13 | 2.626000e+03 | 0.000000 | 0.000000 | NaN | 48.812580 | 2.225720 | 0.000000 | NaN | 1.000000 | ... | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010000 |
| 25% | 2.021121e+13 | 1.340742e+07 | 1.000000 | 1.000000 | NaN | 48.850820 | 2.324110 | 2.000000 | NaN | 1.000000 | ... | 4.46000 | 4.760000 | 4.790000 | 4.750000 | 4.500000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.120000 |
| 50% | 2.021121e+13 | 3.883365e+07 | 1.000000 | 1.000000 | NaN | 48.865320 | 2.347900 | 2.000000 | NaN | 1.000000 | ... | 4.75000 | 4.930000 | 4.950000 | 4.920000 | 4.720000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.370000 |
| 75% | 2.021121e+13 | 1.470187e+08 | 2.000000 | 2.000000 | NaN | 48.878530 | 2.369270 | 4.000000 | NaN | 2.000000 | ... | 4.97000 | 5.000000 | 5.000000 | 5.000000 | 4.900000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 2.021121e+13 | 4.350258e+08 | 1288.000000 | 1288.000000 | NaN | 48.904860 | 2.472030 | 16.000000 | NaN | 50.000000 | ... | 5.00000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 252.000000 | 249.000000 | 30.000000 | 10.000000 | 50.860000 |
8 rows × 39 columns
df_main.columns
Index(['name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood',
'latitude', 'longitude', 'room_type', 'price', 'minimum_nights',
'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365',
'number_of_reviews_ltm', 'license'],
dtype='object')
# display columns that are not already in df_main
for elt in df_listings.columns:
if elt not in df_main.columns:
print(elt, "(",df_listings[elt].dtype, ")", end=', ')
listing_url ( object ), scrape_id ( int64 ), last_scraped ( object ), description ( object ), neighborhood_overview ( object ), picture_url ( object ), host_url ( object ), host_since ( object ), host_location ( object ), host_about ( object ), host_response_time ( object ), host_response_rate ( object ), host_acceptance_rate ( object ), host_is_superhost ( object ), host_thumbnail_url ( object ), host_picture_url ( object ), host_neighbourhood ( object ), host_listings_count ( float64 ), host_total_listings_count ( float64 ), host_verifications ( object ), host_has_profile_pic ( object ), host_identity_verified ( object ), neighbourhood_cleansed ( object ), neighbourhood_group_cleansed ( float64 ), property_type ( object ), accommodates ( int64 ), bathrooms ( float64 ), bathrooms_text ( object ), bedrooms ( float64 ), beds ( float64 ), amenities ( object ), maximum_nights ( int64 ), minimum_minimum_nights ( float64 ), maximum_minimum_nights ( float64 ), minimum_maximum_nights ( float64 ), maximum_maximum_nights ( float64 ), minimum_nights_avg_ntm ( float64 ), maximum_nights_avg_ntm ( float64 ), calendar_updated ( float64 ), has_availability ( object ), availability_30 ( int64 ), availability_60 ( int64 ), availability_90 ( int64 ), calendar_last_scraped ( object ), number_of_reviews_l30d ( int64 ), first_review ( object ), review_scores_rating ( float64 ), review_scores_accuracy ( float64 ), review_scores_cleanliness ( float64 ), review_scores_checkin ( float64 ), review_scores_communication ( float64 ), review_scores_location ( float64 ), review_scores_value ( float64 ), instant_bookable ( object ), calculated_host_listings_count_entire_homes ( int64 ), calculated_host_listings_count_private_rooms ( int64 ), calculated_host_listings_count_shared_rooms ( int64 ),
# Data dictionary : https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896
listings_cols_to_keep = ["host_is_superhost",
"accommodates",
"bedrooms",
"beds",
"availability_60",
"number_of_reviews_l30d",
"availability_90",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_location",
"review_scores_value"
# TODO Find More to add
]
df = pd.merge(df_main, df_listings[listings_cols_to_keep], on='id', how='left')
df
| name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | ... | beds | availability_60 | number_of_reviews_l30d | availability_90 | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 5396 | Explore the heart of old Paris | 7903 | Borzou | NaN | Hôtel-de-Ville | 48.852470 | 2.358350 | Entire home/apt | 102 | 2 | ... | 1.0 | 28 | 3 | 58 | 4.56 | 4.48 | 4.77 | 4.81 | 4.96 | 4.53 |
| 7397 | MARAIS - 2ROOMS APT - 2/4 PEOPLE | 2626 | Franck | NaN | Hôtel-de-Ville | 48.859090 | 2.353150 | Entire home/apt | 112 | 10 | ... | 2.0 | 5 | 2 | 24 | 4.79 | 4.43 | 4.91 | 4.88 | 4.92 | 4.71 |
| 7964 | Large & sunny flat with balcony ! | 22155 | Anaïs | NaN | Opéra | 48.874170 | 2.342450 | Entire home/apt | 130 | 6 | ... | 1.0 | 39 | 0 | 69 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
| 9359 | Cozy, Central Paris: WALK or VELIB EVERYWHERE ! | 28422 | Bernadette | NaN | Louvre | 48.860060 | 2.348630 | Entire home/apt | 75 | 180 | ... | 1.0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 9952 | Paris petit coin douillet | 33534 | Elisabeth | NaN | Popincourt | 48.863730 | 2.370930 | Entire home/apt | 81 | 4 | ... | 1.0 | 3 | 0 | 25 | 4.97 | 4.88 | 5.00 | 4.91 | 4.91 | 4.94 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53703809 | Romantic Saint-Germain-des-Prés & Odeon ➳ NEW | 23834078 | Daniela | NaN | Luxembourg | 48.851410 | 2.338214 | Entire home/apt | 290 | 30 | ... | 4.0 | 54 | 0 | 84 | NaN | NaN | NaN | NaN | NaN | NaN |
| 53711459 | CALM New deco APT.with a inner yard | 418882378 | Eléo | NaN | Opéra | 48.877367 | 2.341647 | Entire home/apt | 88 | 1 | ... | 1.0 | 39 | 0 | 69 | NaN | NaN | NaN | NaN | NaN | NaN |
| 53711747 | Studio PAISIBLE for 2 in PARIS 20ème | 435019549 | Yassine | NaN | Ménilmontant | 48.864185 | 2.404030 | Entire home/apt | 70 | 1 | ... | 1.0 | 43 | 0 | 73 | NaN | NaN | NaN | NaN | NaN | NaN |
| 53712582 | Superb flat in the heart of Paris : Amelie’s h... | 6174398 | Laure-Anne | NaN | Buttes-Montmartre | 48.885458 | 2.339471 | Entire home/apt | 150 | 2 | ... | 1.0 | 4 | 0 | 11 | NaN | NaN | NaN | NaN | NaN | NaN |
| 53712667 | Apt refurbished near LES CATACOMBES DE PARIS | 435025752 | Muriel | NaN | Gobelins | 48.832718 | 2.343418 | Entire home/apt | 97 | 1 | ... | 2.0 | 1 | 0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
49429 rows × 30 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49429 entries, 5396 to 53712667 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 49384 non-null object 1 host_id 49429 non-null int64 2 host_name 49397 non-null object 3 neighbourhood_group 0 non-null float64 4 neighbourhood 49429 non-null object 5 latitude 49429 non-null float64 6 longitude 49429 non-null float64 7 room_type 49429 non-null object 8 price 49429 non-null int64 9 minimum_nights 49429 non-null int64 10 number_of_reviews 49429 non-null int64 11 last_review 38767 non-null object 12 reviews_per_month 38767 non-null float64 13 calculated_host_listings_count 49429 non-null int64 14 availability_365 49429 non-null int64 15 number_of_reviews_ltm 49429 non-null int64 16 license 28122 non-null object 17 host_is_superhost 49397 non-null object 18 accommodates 49429 non-null int64 19 bedrooms 40293 non-null float64 20 beds 47485 non-null float64 21 availability_60 49429 non-null int64 22 number_of_reviews_l30d 49429 non-null int64 23 availability_90 49429 non-null int64 24 review_scores_accuracy 38119 non-null float64 25 review_scores_cleanliness 38124 non-null float64 26 review_scores_checkin 38105 non-null float64 27 review_scores_communication 38117 non-null float64 28 review_scores_location 38103 non-null float64 29 review_scores_value 38100 non-null float64 dtypes: float64(12), int64(11), object(7) memory usage: 11.7+ MB
df.describe()
| host_id | neighbourhood_group | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | ... | beds | availability_60 | number_of_reviews_l30d | availability_90 | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.942900e+04 | 0.0 | 49429.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 38767.000000 | 49429.000000 | 49429.000000 | ... | 47485.000000 | 49429.000000 | 49429.000000 | 49429.000000 | 38119.000000 | 38124.00000 | 38105.000000 | 38117.000000 | 38103.000000 | 38100.000000 |
| mean | 9.896725e+07 | NaN | 48.864008 | 2.344914 | 129.891986 | 109.405086 | 21.636590 | 0.796471 | 10.652188 | 99.513059 | ... | 1.729367 | 13.859698 | 0.458051 | 23.360922 | 4.760520 | 4.59638 | 4.802476 | 4.813442 | 4.805913 | 4.618081 |
| std | 1.208700e+08 | NaN | 0.018181 | 0.033072 | 217.186069 | 168.774442 | 46.339395 | 1.191827 | 35.528471 | 134.092923 | ... | 1.354777 | 20.322719 | 1.391720 | 32.375636 | 0.422341 | 0.53207 | 0.399383 | 0.401767 | 0.348673 | 0.464296 |
| min | 2.626000e+03 | NaN | 48.812580 | 2.225720 | 0.000000 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.340742e+07 | NaN | 48.850820 | 2.324110 | 61.000000 | 2.000000 | 1.000000 | 0.120000 | 1.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 4.700000 | 4.46000 | 4.760000 | 4.790000 | 4.750000 | 4.500000 |
| 50% | 3.883365e+07 | NaN | 48.865320 | 2.347900 | 90.000000 | 4.000000 | 6.000000 | 0.370000 | 1.000000 | 3.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 4.880000 | 4.75000 | 4.930000 | 4.950000 | 4.920000 | 4.720000 |
| 75% | 1.470187e+08 | NaN | 48.878530 | 2.369270 | 140.000000 | 365.000000 | 21.000000 | 1.000000 | 2.000000 | 198.000000 | ... | 2.000000 | 27.000000 | 0.000000 | 51.000000 | 5.000000 | 4.97000 | 5.000000 | 5.000000 | 5.000000 | 4.900000 |
| max | 4.350258e+08 | NaN | 48.904860 | 2.472030 | 10250.000000 | 9999.000000 | 1809.000000 | 50.860000 | 252.000000 | 365.000000 | ... | 90.000000 | 60.000000 | 62.000000 | 90.000000 | 5.000000 | 5.00000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
8 rows × 23 columns
msno.matrix(df)
<AxesSubplot:>
df['neighbourhood_group'].unique()
array([nan])
We note that the neighbourhood_group has only one unique value which is nan. As a result, this column is not filled in, we then delete it.
del df['neighbourhood_group']
df['license'].isna().sum() / len(df)
0.431062736450262
Moreover,license holds more than half of its content as missing values. Being a column bearing few meaning we choosed to delete it.
del df['license']
print("The last_review column (The date of the last/newest review) has {:.2%} missing data." \
.format(df['last_review'].isna().sum() / len(df)))
The last_review column (The date of the last/newest review) has 21.57% missing data.
In addition, a relevant default date cannot be specified. We choosed to delete it.
del df['last_review']
df['host_name'] = df['host_name'].fillna("NO_NAME")
def process_empty(x):
x = x.replace(' ','')
return x if x!='' else np.NaN
for col in df.columns :
try:
df[col] = df[col].apply(process_empty)
except:
pass
data_type = {
'int':['id','price','host_id','minimum_nights','calculated_host_listings_count',
'availability_365','number_of_reviews_ltm'],
'float':['latitude','longitude','reviews_per_month'],
'datetime':['last_review'],
'object':['name','host_name','room_type']
}
for col in df.columns:
for key, values in data_type.items():
if col in values:
if key == 'datetime':
df[col] = pd.to_datetime(df[col])
df['last_review'] = df['last_review'].dt.strftime('%d-%m-%Y')
else:
df[col] = df[col].astype(key)
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['host_is_superhost'] = df['host_is_superhost'].fillna('f')
# df_listings[df_listings['bedrooms'].isna()]['listing_url']
# quand bedrooms = NaN, pas de chambre, ex studio
df['bedrooms'] = df['bedrooms'].fillna(0)
df['beds'] = df['beds'].fillna(1)
df['name'] = df['name'].fillna(" ")
scores = ['accuracy', 'cleanliness', 'checkin', 'communication', 'location', 'value']
df['review_scores_' + scores[0]].isna().sum() / df.shape[0]
for elt in scores:
index_with_nan = df.index[df[['review_scores_' + elt]].isnull().any(axis=1)]
df.drop(index_with_nan,axis=0, inplace=True)
msno.matrix(df)
<AxesSubplot:>
As we can see in the following output, in our new, cleaned data set, we no longer have any missing data:
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 38095 entries, 5396 to 53669791 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 38095 non-null object 1 host_id 38095 non-null int32 2 host_name 38095 non-null object 3 neighbourhood 38095 non-null object 4 latitude 38095 non-null float64 5 longitude 38095 non-null float64 6 room_type 38095 non-null object 7 price 38095 non-null int32 8 minimum_nights 38095 non-null int32 9 number_of_reviews 38095 non-null int64 10 reviews_per_month 38095 non-null float64 11 calculated_host_listings_count 38095 non-null int32 12 availability_365 38095 non-null int32 13 number_of_reviews_ltm 38095 non-null int32 14 host_is_superhost 38095 non-null object 15 accommodates 38095 non-null int64 16 bedrooms 38095 non-null float64 17 beds 38095 non-null float64 18 availability_60 38095 non-null int64 19 number_of_reviews_l30d 38095 non-null int64 20 availability_90 38095 non-null int64 21 review_scores_accuracy 38095 non-null float64 22 review_scores_cleanliness 38095 non-null float64 23 review_scores_checkin 38095 non-null float64 24 review_scores_communication 38095 non-null float64 25 review_scores_location 38095 non-null float64 26 review_scores_value 38095 non-null float64 dtypes: float64(11), int32(6), int64(5), object(5) memory usage: 7.3+ MB
An outlier of a dataset is defined as a value that is more than 3 standard deviations from the mean.
df.describe()
| host_id | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | ... | beds | availability_60 | number_of_reviews_l30d | availability_90 | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.809500e+04 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | ... | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 | 38095.000000 |
| mean | 9.369813e+07 | 48.863862 | 2.345970 | 122.423783 | 90.069720 | 28.052815 | 0.809993 | 9.625856 | 100.677175 | 4.717181 | ... | 1.713112 | 13.537735 | 0.594277 | 23.171256 | 4.760658 | 4.596519 | 4.802792 | 4.813728 | 4.805951 | 4.618057 |
| std | 1.161688e+08 | 0.018109 | 0.032499 | 181.146475 | 160.968654 | 51.055294 | 1.197874 | 33.324839 | 132.730567 | 14.673778 | ... | 1.409655 | 19.494251 | 1.559542 | 31.384957 | 0.421444 | 0.531285 | 0.397959 | 0.400307 | 0.348554 | 0.464299 |
| min | 2.626000e+03 | 48.812580 | 2.229290 | 0.000000 | 1.000000 | 1.000000 | 0.010000 | 1.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.246941e+07 | 48.850790 | 2.325840 | 62.000000 | 2.000000 | 4.000000 | 0.130000 | 1.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 4.700000 | 4.460000 | 4.760000 | 4.790000 | 4.750000 | 4.500000 |
| 50% | 3.730673e+07 | 48.865110 | 2.348850 | 90.000000 | 3.000000 | 10.000000 | 0.380000 | 1.000000 | 7.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 4.880000 | 4.750000 | 4.930000 | 4.950000 | 4.920000 | 4.720000 |
| 75% | 1.362256e+08 | 48.878235 | 2.370150 | 137.000000 | 30.000000 | 30.000000 | 1.000000 | 2.000000 | 203.000000 | 5.000000 | ... | 2.000000 | 26.000000 | 0.000000 | 49.000000 | 5.000000 | 4.960000 | 5.000000 | 5.000000 | 5.000000 | 4.900000 |
| max | 4.346335e+08 | 48.904860 | 2.472030 | 9280.000000 | 9999.000000 | 1809.000000 | 50.860000 | 252.000000 | 365.000000 | 1705.000000 | ... | 90.000000 | 60.000000 | 62.000000 | 90.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
8 rows × 22 columns
Two features are not displayed in the table above: "accommodates" and "bedrooms"
df[["accommodates", "bedrooms"]].describe()
| accommodates | bedrooms | |
|---|---|---|
| count | 38095.00000 | 38095.000000 |
| mean | 3.06384 | 1.111799 |
| std | 1.62012 | 1.068073 |
| min | 0.00000 | 0.000000 |
| 25% | 2.00000 | 1.000000 |
| 50% | 2.00000 | 1.000000 |
| 75% | 4.00000 | 1.000000 |
| max | 16.00000 | 50.000000 |
f, ax = plt.subplots(figsize=(13, 7))
sns.boxplot(data=df, x='room_type', y='price', ax=ax)
ax.set_ylabel('price (€)')
ax.set_xlabel('room type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = f.suptitle('price by room_type', fontsize=20)
WARNING: We observe abnormally high prices for the categories "Entire_home/Apt" and "private_room". These outliers on the target variable prevent us from having a good Machine Learning model. We need to remove them.
NOTE: The outliers impact the value of the mean and standard deviation of the values. As our outlier selection algorithm uses these two quantities and as we perform only one selection step, there will be outliers after our treatment but the maximum price obtained will not penalize future processing.
target = df['price']
# Deletion of rows with a negative or zero price
condition_positive_price = target > 0
df = df[condition_positive_price]
# Selection of outliers for price by room type
filtered_entries = []
for room_type in df['room_type'].unique():
condition = df['room_type'] == room_type
df_room_filtered = df[condition]
z_scores = stats.zscore(df_room_filtered['price'])
abs_z_scores = np.abs(z_scores)
# filter the values less than 3 standard deviations from the mean
filtre = abs_z_scores < 3
filtered_entries.append(filtre)
rows_to_keep = pd.concat(filtered_entries)
rows_to_keep.sort_index(inplace=True)
nb_outliers = target.shape[0] - rows_to_keep.sum()
text_to_print = "{} appartments out of a total of {} (i.e. {:.2%}) are removed as outliers" \
.format(nb_outliers, target.shape[0], nb_outliers/target.shape[0])
print(text_to_print)
296 appartments out of a total of 38095 (i.e. 0.78%) are removed as outliers
df = df[rows_to_keep]
df.shape
(37799, 27)
df['price'].describe()
count 37799.000000 mean 113.616366 std 83.096892 min 8.000000 25% 61.000000 50% 90.000000 75% 135.000000 max 663.000000 Name: price, dtype: float64
f, ax = plt.subplots(figsize=(13, 7))
sns.boxplot(data=df, x='room_type', y='price', ax=ax)
ax.set_ylabel('price (€)')
ax.set_xlabel('room type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = f.suptitle('price by room_type', fontsize=20)
f, ax = plt.subplots(figsize=(13, 7))
sns.violinplot(data=df, x='room_type', y='price', ax=ax)
ax.set_ylabel('price (€)')
ax.set_xlabel('room type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
_ = f.suptitle('price by room_type', fontsize=20)
NOTE: The average price and the distribution of prices around this average value are very different according to the variable room_type
from pandas_profiling import ProfileReport
# Caution: Profiling the df dataset is time consuming
profile = ProfileReport(df, title='df Profiling Report', html={'style':{'full_width':True}}, \
vars = {'cat': {'check_composition' : False}})
profile.to_notebook_iframe()